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. This report examines how selection fairness is 

iMluenced by tie item characteristics of a selection instrument in 
terms of its distribution of item difficulties, level of item 
discriminati<^, and degree of item bias. Computer simulation was used , 
ifa the administration of conventional ability tests to a hypothetical 
target population consisting of a minority and a majority subgroup.^ 
Fairness was evaluated by three indices which reflect the degree of 
differential validity, errors in prediction ' (Cleary* s model) and 
proportipn of applicants exceeding a selection cutoff (Thorndike^s 
model). Major findings were: (1). tests with, a uniform distribution of 
difficulties had fairness properties generally superior to tests 
having a peaked distribution of item difficulties; (2) subg^rbup 
validity differences can be expected to occur when test items aref 
biased against one of the subgroups; (3) when differential prediction 
is used, the Thorndike model reflects varying d eg r pies of unfairness. 
due to item bias and other test characteristics, while the Cleary and 
validity models do not; and (4) differential prediction provides 
fairer selection than the use of majority prediction only, regardless 
of the internal chare^cteristics^ of the test, although substantial 
degrees of unfairness still exist under certain tetet item 
configurations. It was concluded' that the internal characteristics of 
ar selection instrument will affect the fairness of test scores in 
specific applications and that further research is' needed to 
delineate which testing strategies and/or item characteristics are 
optimal in reducing unfairness. (Author)^ >. 
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Effects of Item Charact^i^isTics on Test Fairness 

r\ . • • 

r 

Mental ability testing is* commonly use^ in education, Industry and the 
military services to select and place l^^dividuals. T^est teSUlts are also 
Used In research as a basis for making inferences about the tntelle.ctual 
endowment of various individuals and ^i^t'groupg. However, manyj3f these <v 
tests have of ten -been cited as teing biased and/or unfair to cet^tain subgroups 
Of the^ general .population, including Bj^ck^^ Spanish-speaking AJ»ericans and 
Native Americans. Because of the prev^^enc^ of testing in out society , 
^nd because of the possible dlscrimin^t^^'ry n^^^yre of some t^^ts* there has 
recently' been an InctTease in Wsearch tb^ nature and degree of test bias and 
te«^ fairness In various settings, inQji^dlng examination of Various ways to 
reJice* test bias and unfaimesS where t^t^ey e^ist, / ^ 

' * <» " ■ 

A necessary prerequisite for car^^yillg o^t meaningful r^^eafch in this atea . 
ts to define exac^rly what is meant by ijias ^tid unf&lrness. Ovet*' th^last 
ten years, a number of models have^^be^tj prc?tosed to provide ^uch def initiotis . 
M^ny of these models are. quite differ^tJ*^ ^ti philosophy and pUrpose. A useful 
taxonomy often suggested (Flau^er, 197^; McHemar, 19^5; V^ti^ & Weiss, 19?^) 
is to separate models of ^ias from moci^ls of fairness. -'The essential distinction 
is that models of bias represent the p^ychcJ^ietrlc properties of a particular set 
of fest items or test scores. Models te^t fairness typically are concerned 
with the impact a test will have when a^ed iu a particular application. The 
application IMS t often considered is tYi^ selection or placei^^nf of persorihel. 

-'However, there is a direct relatl^j^shiP between the iteUi characteristics 
of a test, including the. degree of it^ bi^s^ ^nd its fairn^s^when used in a 
selection program.- Although substantial' aitJOvmts of rgsear^p^ave dealt with 
the effects of item characteristics or^ test Validity (Brog^en, 1946; Gulliksen,^. 
1945; Tucker, 194'&r^i^y» I9j69), no ef^^rt^ ^have been made to study the"^ effects . 
of item characteristics on test falm^^s. Even for validity* the effects of 
possible bias in the test Items have \\d^ bee^ considered. 

i«^^'Ther6. are a nuinber^ of possible r^^son^ for this lack o£ research. First, 
selection fairness moc^els are relativ^j,y n^w. Second, empirical investigation 
in this area is often expensive, impr^^tic^l ^ue to the relative unavailability 
of minority group metribers, and hamper^^ by^t^e absence of ^ suitable, unbiased 
criterion measure. yFurthfermore, in sej^^ctioti of fairness models, tests are . 
considered only, in terms' of their fin^:t scc?i^es. Therefore, the internal 
properties of a test are generally ±^o^^^^ This approach 1,5 detrimental to 
. the developmeht of tests which might designed to teduce utifairness. 

This report offers a general metl^t?^ examining the relationship between 
Selection and placement fairness and characteristics of te^t items. This is 
accomplished by conceptualizing bias ^^jd fairness in term& of^ latent trait theory 
Criterion performance Is represented th^ latent trait. Item bias and other 
item characteristics are expressed Ip <?erm^ of latent trait parameters. Thl5 
apprc?ach eliminates ttie "possibility t^^t tue .criterion itself loay be biased, and 
.permits direct observation of ho\^ the characteristics of a test affect the 
prediction of a criterion' and, in turt^^ selection fairness, 

•* ■ • . 



Bias and Fairness - . ' « i, - * ^ 



Bias, as it :|s used in this report, refers to thosd subgroup diff-erences in 
the psychometric propertl^ of a test which occur as a result of factors . ■ 
extraneous to those x^hfch a test is intended to measure, For example, mean 
test score differences between Blacks and Whites on a ;mcabulary test would be 
cons4.dered evidence of bias if these differences refl'ected the influence of ^ 
cultural factors. In this case, the cultural factors would be e^jtraneous 
since, presumably, the test is intended to measure verbal ability. 



Most of the model's of bias which have been pf-ppose'd (Angoff & For<L, 1973; 
Breland,^ Sto^cking, Pinchak, & Abrams, 1974; Echternact, 19^4) have involved 
compa£infi.Jrt em difficulties among subgroups. ^According to this 'approach, a 
tecTf^is considered biased if its items do not have roughly the same relative 
difficulties for all subgroups. An item within the test is said to be biased 
if it is relatively more difficult for a given subgroup than .are most of the 
other* test items. Other models ofi bias which have been proposed involve 
subgroup comparison of item discriminations, mea^ test- scores, and factor 
loadings i^e^g: ^ Angoff, 1975; Atkin,'Bray, Davison, Herzberger-, Humphreys, 
& Selzer, 1976; Jensen, 1975), 




Regardless" of the specific model used, thfr-«xistence of bias cannot by 
itself be taken as prima facie evidence that a test is* unfair. For example, a 
test which includes a substantial proportion of Black slang wo'rds may be unfair 
when used to select college freshmen, but fair when used to select social 
workers for emplo3mient in the Black community.' Clearly then, the fairness 
of a test (or test item) can only be detennined by examining what cause(J the 
bias and what its eventual impact will be . in a specific application. • 

For the specific application of ,tests to the selection of personnel, a 
number of formal definitions of fairness have been developed. One of the 
earliest tonnal definitions of -tes* fairness in «eldction was^bateed on the 
concept of val id^ty. ■ This is undoubtedly due to the fact that e^rly legal 
challenges to the us* of tests for personnel- selection questioned test validity. 

Validity model of test fairnegs . The validity^ model is primarily 
j:oncemed with the legitimacy of the Inferences which can be made about people \s 
ability or performance in a specif ic situation-based* on t:h€?ir test scores, ^e 
validity of a test is frequently ' determined liy calculating the correlation 
coefficient between the test scores and scores on an appropriate criterion for 
a particular subgroup. Fairness of a testing procedure has been evaluated in 
terms of whether there is a si^if icant difference between the validity coef- 
ficients, for various subgroups on a given test. If a significant difference 
does exist, this would iinply th^t the predictions made* on the basis of the 
est scores ^re not a^ accurate for one subgroup as for another. 

\ , ' " ■ 

In a selection situation, such a difference ±n validity would have several 
adverse effects on the/subgroup having- the lower correlation. First, it would 
decrease the variance of the predicted s'core distribution. Assuming the selection 
cutoff to be above the mean of this subgroup, a^ it normally would be, such a 
decrease in variance would lower the probability that^ these individuals would 



exceed the selection cujtoff . Secondly, the lover, cOTrelation coefficient ^-indi- . 
cates that the test dq^s not order individuals as accuratei;^ on the criterion 
as it would for a subgroup having a higher validity coefficient. Consequently, 
'if .selection is based <fn predicted critetion per^f ormance, applicaxrts with IqwSr ^ 
average ability will be selected from the subgroup having the lower predictive 
validity even in. cases where thfe subgrouf)s have equal mean ability./. 

Whether or not meaningful validity differences among subigroWps ocqur in 
real selection situations is an empirical is^ue which has receiyed/a great Meal 
of attention recently. The weight of the evidence (Campbjsll, CrooKs, Mahoneyl 
*&'roc1c, 1973; Fafr, O'Leary, Pfeiffer, Goldstein, &' Bartlett, 1971; Schijildt,-^ 
Berner, & Hanter, 1973) seems to indicate that meaningful dif ferencies occur with 
very low frequency. However, a number of issues remain unresolved regarding 
how to statistically test for a subgroup validity difference and what to do if 
it is atatistically significant (g.g. , Standards for Educatipna l ^and Psycholofe - 
leal Tests , 1974; Flaugher, 1974). , - . ^' ^ • 

Although research atill continues ^^nl^differential validity as a means of 
evaluating test fairness, it appears that validity is a necessary , but not a 
sufficient condition for test fairness. In riecognition- of this fact, a humber 
of specific models haVe been proposed for deSining fairness 'in the context of ^s- 
selection. ^ , / t 

, Other jnodels of fairness . In the context of selection, test fairness is, 
directly interpretable in terms of the number of applicants Vho are selected 
from eaj^h subgroup of testees. Test bia^ influences fairness to the Q^tent ^ 
that l£\a test is biased^, it will often produce an adverse impact on the sub- 
group against which it is biased. This, however, will depend on how fairness 
is. defined and gnNOther situational variables, such as the criterion for success 
and Selection cutoff pqints . ^ 

When a test is used in the selection process, it is part of a decision^ 
strategy^ to sel.ect or reject potentially successful individuals for one or more 
available.. openings. Operat^ionally , this is usually achieved by letting a cut- 
off sc<:>re on the criterion to define successful performance, determining tfife 
corresponding predictor cutoff scores, and^ selecting applicant's with predictor 
scores equal to or exceeding the predictor cutoff score. ^ 



It was previously indicated that a low test validity for a given subgroup,, 
.which is equivalent to a larger amount of random errors of prediction, can 
affect selection decisions by decreasing the probability that individuals from 
that suljgroup would exceed a given (y^toff on the criterion. - Another factor which 
would affect the prediction of criterion performance in selection is constant 
errors of prediction.^ The random arid constant errors of prediction can.be res-, 
pectively translated by regression theory into the slope and intercept of the 
regression line relating test scores, to crtteriorr performance . 

^ Cleary (1968) developed a widely used, dellnition of selection fairness, 
-refe^t^.to by her as *bias.i , which involves the regression line in prediction. 
According -to'cieary, "A test is biased for members of a subgroup of the popu- ^ 
lati<;(n if, in the prediction of a criterion for which the test was designed; 



consistent nonzero ^errors of prediction are made for members* of the\subgroup. . 
(Cleary, 1968^, p. Theoretically/ consistent zero ^errors of prediction are 

assured by employing separate A/ithin-subgroup -regression lines, i.es, differen- 
tial prediction. <t Therefore* the apiAication of.Cleary's definition is operation- 
ally equivalent to endorsing differential prediction in selection. 

' i • ^ 

' This fact c^n be demoastrated by considering th§ situation in Figure 1. 
I Figure la illustrates the |situation in which the mean criterion sc^ove for the 
minorit;y subgroup (Y^^j^^) is equal to the mean criterion score for the majority 

subgroup CT but the mean test score of the majority subgroup (^^^.) is 

ma ' ' • U13.J 

areater than score CX , ) pf the minority subgroup. In this situation it is 

mm 

clear. that use of witjiin-subgroup regression lin^s, i.e., differential prediction 
will produce cdhsistent zero errors of prediction for both the minority ani/ 
majority subgroups. However, using either the regression ^^ine of the majority 
subgroup or the regression line derived from data pooled across both subgroups 
will lead to underprediction df the minority subgroup, 

A situation more commonly found in extant practice (tleary, I968; Gael, 
Grant, & Ritchie,, 1975; Goldman^& Ri^lfards, 1974; Kallingal, 1971; Temp, 1971) ^ 
i^ where subgroups differ 'on both the criterion and te^ scores, as shown. in 
Figure lb», In this case, using either the majority op pooled regression line ^ 
to predict minority criterion performance will result in o^erpredictiqrv for 
menders of that subgroup. ' 

In recent years, a number of models have been proposed ^ alternatives to 
Cleary 's regression model of selection fairness (see Cole, 19^»;^an^'Petersen. & 
Novick, 1974, for reviews). The one most frequently offered as W alternative 
to deary's roociel is. Thomdike's (1971) Constant Ratio model.- According to 
Thomdike* fair use of test scores requires that the acceptance levels should be 
set such that the ratio of the percentage of individuals wlfo'-iexc^ed a specified 
level of criterion performance to the percentage who^ exceed a cutoff on the pre- 
dictor Will be equalized among subgroups in the applicant population. 

One of the primary conclusions that has derived from the research o^^est 
fairness* is that the assessment of fairness will depend^ on how fairness is 
defineli. Some of the models that have been proposed will lead^ to the selection 
of more mindrity^applicants than will other models. If the models are ordered 
along the dimension of how many mifiority applicants are selected in a given 
situation, the Cleary and Thomdike models fall near the extremes. The Cleary 
model is the least favorable to miaiority subgroups, while the Thorndike model is 
one of ' the most favorable. , Con^e^efi^y^ these two models make a convenient 
pair of strategies for evaluating^he faixhess of a test. 



Pujg^se aOg As sumptions » 

^ " , * ' . . • • ' \ 

Pur£Ose. In. their , book on mental test theory. Lord and Novick (1968, p. 
388) indicate how the item characteristics of a test^can affect the shape of the 
distrififetion of tes^t 'scores.' As can be seen in Figute 1, selection fairness is 
a function of the parameters of thie distribution of test scores. Therefore, , if 
the ite^p characteristics of a test can affect the shape of the test score- dis- 
tributionr they will also influence selection fairness. ' The purpose of this 
report is to diamine the relationship between characteristics of test items* and 
selection fairness, afs reflected by several fairness models. 




Figure 1 

Relationships between criterion scores and* test scores 
— V for aajor^iiy aad minority subgroups 

\jith unequal mean sjcores on the predictor vardmbles 

■ r 
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Specifically, , #ollow;4.|lg questions are investigated: 

1. "how do the fo [flowing characteristics of test items affect fairness? 

a. Distribution of item difficulties. - . 

b. Level of item discrimination. 

c. Degree of item bi^s . 

2"; How is fairness affected by^est lengt^? 

3. rigwMoes the assessment of^^irness depend on the cjioice of a model for 
.fairness? 




} 



Answers to thes^Ve^tons shoulcJ be useful itl indicating how aN^-Air test should 
"be coqstructed. 

Assumptions . Xhe above questd^Ons were investigated^ in the context of an 
assumed selection situation which was modeled by a montfe, carlo simula^on study- 
Jlie selection process consisted of administering > selafction telst to ed<;Jiappli- 
cant and using the score from that test to predict an .e^jcternal cri te^ion r^pre- 
:;ented by the known latent trait, 0. The applicant poi)ulatlon was assumed to v 
. consist of two subgroups havi' ag identii al .ability distributions on 0. The 
3electior instrument was assumed to be completely described In terms of its" 
\^r*^nT 1-ralt paramefcpi s so that earh^'of its items could be descrilj^^d in 'terrn.^ 
- /item discrimination, item difficulty, and probabili,ty of being guessed ^ 
f .correctly by chance. Soine of the items in the test, however, were assumed to be 
V/ biased agai-nst the minority Subgroup and the degree of their bias'^was eJcpressed^^ 
in terms of the latent-trait itein parameters. X ^ 



M • ■ 

METHOD 
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Inde^j^^^^nt 



clent Variables 



Ptnir of the -Independent variables were characteristic of the test a4nii|.n- 
istered to both the majority and minority subgroups simulated in t^s stud^. 
Xhree of these variables—diatriButibn of itein difficulties, lev<li' of item 
discrimination,^ ^and -test length — are standard charadteriatic^ jof testa. The' 
fotirth, item bias, refl^ected the major independent' variable of Interest in this 
study. The fifth independent variable was intended fo vary the fairness in the 
appiica-tion of test scores. This, variable consisted of using only Jihe regression 
equati<3r from the majority subgrotip^or differential prediction, • f5>r the predic-- > 
tion-of, a simulate^: criterion vari^e.^ Figure 2 summ^^rizes the^independent \ 



"^variables used in i:his study. w i 



le« t Variables 



Only conventional tests' were used in this study. "That is, allTsimulated' 
ucoLet^s wiLiiiii .^n expc;.imetl;^l condition were admiriiopered idoi\£,ical items in a 
fixed sequence. Test items^re' represented by a set of latent ^trait parameters, 
(Lord & Novick, 1968, p. .1^", which described the essential statistiQal proper- 
ties of^ach item. A test oT length m with a given sef of characteristics, was o 
l~-?nerated by selei:ting the fil^s. /'f items from-^one of eighteen iOO-.it.-!* pools. 
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Test length 
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Test Length 
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Test Length 
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Test Length 
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Each item pool represented one of the experimental conditions^'obtained from 
taking combinations of the three test variables summarized in Table 1. For all 
experimental conditions the guessing parameter, p, was set at .20. This value 
is the expected proportion correct if purely random guessing^ -occurred on five- 
alternative jnultiple-choice items. 



Table 1 

Item Pool Parameter Specifications 



DistrlbuSon 



of Difficulties 


a 


Bias 


Unlf6i^m or Peaked 


.30 


.5 


Uniform or Peaked 


. 30 


1.0 


Uniform or Peaked 


.30 


2.0 


Uniform or Peaked 


.70 


.5 


Uniform or Peaked 


.70 


1.0 


Uniform or Peaked 


.70 


2.0 


Uniform or Peaked ' 


1.10 


.5 


Uniform or Peaked 


1.10 


1.0 


Uniform or Peaked 


1.10 


2.0 
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Distribution of item dlf f Icultles > Tests were simulated which had either 
peaked or uniform d^sti^lbutlpns item difficulties. The peaked distributions 
of difficulties (b) were randomly sampled f;rom a normal distribution having a 
mean of F=0 (whe^e O'indicates an item of average , difficulty) and a standard 
deviation of 1.0. The uniform distributions of difficulties ^so had a mean of 
F=0 but were randomly sampled from a uniform distribution wh;Lch' ranged from 
£==-2.99 to +2.99. The actual distribution of item difficulties used in each 
condition is summarized in Table A in the Appendix. . / • 

Item dlscrlmlnatl-ons . Three levels of item discrimination were used within 
both the peaked and uniform tests. 'These three levels, were a=.30, .70 and 1.00, 
corresponding to point-biserial correlations of items with total scores of .127, 
.373 and .482 respectively (assuming a population proportion passing of P=.6 
and a guessing -parameter of a=.2).' Values of item discrimination were held 
constant within jAch testing condition and subgroup. . . 

test length . To study the effects of test length and its interaction- 
with item difficulty distributions, item discrimination, and item bias on test 
fairness,' five typical test lengths were used. Test lengths were 10» 30, 50, 
70 and 100 items. Within each test length, discriminations were constant for a 
given uniform or rectangular test and a sjpetified degree of item bias. 

Item bias . Item bias was defined as . 

^maj ^min' ■ . ^ 




where b . and b\ are the latent trait difficulty parameters for hhe majority " 

maj min / .v^ ^ \ 

and minority subgroups, respectively. _ 

This definition of ^item bias was based on the 'as sumption that the subgroups 
had identical trOe ability distributions on the trait being measured, but that 
items were more difficult for^^the minority subgroup because of some independent 
extraneous factor (s) which reduced their performance on the test items. For 
example, if a test was designed to measure verbal ability, the inclusion of • 
"culturally loaded" items would result in a test which would.be more difficult 
for a no^dominant subgroup of a given culture. The result would be a test which 
would be biased against such minority subgroups. This definition of item bias is 
very similar to those often applied in practice (Angoff &- Ford, 1973; Breland 
et at , 1974; Echtemacht, 1974). The main difference is that previous models^ 
of item bias have "been based on< the proportion correct measure of item difficulty. 
However, proportion correct has been shown (Lord & Novick, 1968; Urry, 1974) to 
be confounded with guessi^|and item discrimination, whereas latent trait diffi- 
culty parameters are pure measures of item difficulty. 

Three' levels of item bias, based on Equation 1, were studied. These were 
5 1 0 and 2.0, indicating tests which were respectively more difficult for 
members of the simulated minority subgroiip. Bias was introduced into the tests 
by adding t^iis constant value to the difficulty parameters of the items selected 
to constitute the majority subgroup test. Item discrimination, guessing and 
test length were held constant as bias was introduced into the testing situation. 
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Prediction of Latent Ability ' ^ > \ 

A raw test score was obtained for each simulated testee by summing the num- 
ber of correct answers for that testee. Correqjt answers to the pth item were 
recorded as ^p=l» while incorrect answers , wer^ represented as ^p^- Therefore, 

the raw test score for the ith 'individual wm^ 



1 . - L* U , 



where m=test length. /' \ - 

Since the objective of the tesi was to obtain an estimate of the latent 
ability 0, a method* was needed to bbtain a prediction of 0 based on the test, 
score i.. ^Linear regreWsion e^ud^tions were used for this purpose. Two kinds of 

regression equations, majority and differential prediction, were used correspond- 
ing to two types of predictior/ procedures often mentioned fn the literature 
(Bar tie tt & O'Leary, 1969; Gpldman & Hewitt, 1975; Jones, 1973; and McNemar, 
1975). One regression equation of each type — njajority prediction and differential 
prediction— was developed .Within each of the eighteen testing conditions. .The 
predicted ability score§ generated' by these regression equations were used to 
define the dependent variables. 

Majority prediction . In this condition, the -same regression equation 
§\ = a + BZ., - , 13] 

where a and 3 are the regression parameters based on only the data from the 
raajdrity subgroup, was used to predict the ability of all individuals regardless 
of subgroup membership . 

Differential prediction . In this condition-, separate within-subgroup re- 
gression equations were used to predict ability for individual, i of subgroup j. 
These are given by * ^ 

• = a • + 3 • . X [4] ' 

^J ^ 3 ^3 

where a . and 3 • were the within-subgroup regression parameter for subgroup .7 » 

3 3 . • - .. 

where j referred to either the majority or minority subgroup. 

Dependent Variables 

The dependent variable in this study was test fairness. Fairness was 
evaluated by three indices separately for each of the 180 combinations of inde- 
pendent variables (i.e., item difficulty distribution x item discrimination x 
test length x bias x prediction method). The three fairness indices were: 1) a 
validity index, R\ 2) a Cleary-type index, C; and 3) a Thorndike^type index, T. 
These fairness measures parallel their original definitions. B^t in this study 

15 ^ 
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the, variatlle being predicted was 0, the knov 
the fallible external criterion usually usee 



true latent abilityy^ ^s compared to 
in research on test^^'faimess. 



In addition to studying the' effects of item bias and othe^' test characteris- 
tics on the^e three definitions of fairness, the effects of the independent 
variables on a number of standard distri^bytional statistics were also, studied. 
These included the mean, standard deviation, standard error of estimate, skew- . 
ness, and kurtosis of the ability estimates, 0. 



The i?-Index 



/ 




The correlation between Estimated ability and the true latent ability, vL 

has been used in latent trait studies as a measure of the "goodness" of abilit 
estimation (Brogden, 1946; Urry,/l969, 1971). In the present study the true 
ability, 0, was taken as th6 cri^terion for selection^ Therefore, r-^ pan be ,^ 

interpreted as a coefficient o/ predictive validity. For simplicity , this coef- 
ficient of validity will be r/ferred to simply as the /?-Index. 

Differences in R betweeti the majority and minority subgroups were examined 
as an indication of test fairness. Larger correlations for one group as compared 
to the other, holding testing conditions constant, would indicate that a given 
set of testing conditions /produced test scores with a greater potential for un- 
fairness for the group having the lower coi:^ ^^ lon. ? 

R was evaluated only for the majority prediction condition since the appli- 
cation of differential prediction amounts to a linear transformation of the 
majority prediction abiljtty estimates, and correlation coefficients are 
unaffected by linear trajas format ions . 



The C-Index 



Based on deary's (i^68) concept of. test bias, the degree of test bias in 
subgroup J can be defineotiias 



= 0. - 0. 

0 0 0 



[5] 



where 0 . and 0 . are the me^s of the ability distributions for the predicted and 

J J A 
true distributions, respectively. When *his definition is applied to the pre- 
dicted abilities obtained from the differential prediction equation given in 
Equation 4, C .=0 in all cases\ This follows since o". will always equal 0 .. Con- 

0 \ ^ . ^ 

sequently, the utilization of ,iif f erential prediction vrill always result in a 

fair test usage according to ttffi Cleary definition. 



The inter-subgroup differeii^pe in the Cleary index is 
^diff = (^min - \in> " ^^maj^ " \aj> = ^mln ^ ^maj ' 



[6] 
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Since in the majority, prediction condition (Equation 3) 

^ maj maj' , 
Equation 6 simplifies to ^ . ^ : 

^diff ^min\ ^ . ^ 

Similarly, in the dif f erentiarN^redictioiK/condition (Equation 4) 

0 . - ^ ' anc} 0 . = "S" . > . ' * " 
maj maj ^ ^ -Tiiin . min. 



[10] 



' and Equation 6 simplifies to 



for all cases. Consequently, the Cleary index, C, wa^^also evaluated only in 
the majority prediction condition. 

The r-Index 

Applying Thomdike*s definition of fairness \o the model used in this study 
a test* is fair if the following Condition is met: 

^' "■^'(0 ^>0 ) P(0 .>0 ) ^ 

^ maj jnaj o - q^v 

■ ' . P(0 />0 ) ~ P(0 , >0 ) , \ 

. min o min o 

where P is the proportion of testees who exceed the cutoff point 0 . In this 
study a cutoff equal to the mean of the majority subgroup, i.e., 0^=0 was u.sedj 

Since identical subgroup ability distributions were assumed. Equation 11 
reduced to , 



P(6 ,>0„) 



■naj- o = 1 • [12]/ 



/ 



P(S , >0 ) 
^ -min o 



or 



P(S ^>0 ) = P(0 >0 ) ' • \[13] 

maj o^ min o . ^ l . 

. <^ . ' ^ • . 
If Equation 13 defines a fair selection situation, . then the degtee td which a 
test is unfair to the minqrity subgroup, as compared to the majotlty subgroup, 

is given' by ; ^ 

= [P(0 />0 ) - P(S .>0 )] X 100 . .. [14] 

diff min o maj o • •. ■ 



or simply the difference be 
selection cutoff in the mincj) 
uated<>in both\the majority 



l:ween the percentage of ^ndividiiala Who exceed the 
ority and majority subgroups > ' The TVIndex was eva.l- 
and differential prediction conditions.. ' 



Data Simulation 



Popul^,tit)n 

'^^"llection of examiT>ee-$ 



from a target population was s^ulat^d w.ith a ^ ; 
computer by generating ,500 random numbers which fell between tAe/valu^s of -3.34 
and +3.24 sampled from a normal population having a mean=0 andi a S.t).=i.O.- Each 
of the random numbers represented the' true ability, 0» for one tektee: 0=0 
indicated an individual, of average ability, while 0=0.0 indicated a person of 
very high ability on the reljevant trait. Since thfe same population distribution 
was used for "both the majority and minority subgroups, the degtee ^of unfairness 
which occurred as 'a result of the characteristics of the test, iteims would be 
manifested as differences betwef^n the predicted distributions of 0 for the two 
subgroups. Similarly, the same 500 values of 0 were used within e^h of the 90 
experimental 'conditions. In this way, differences observed in the dependent 
variables could be attributed solely to action of , the independent variables. 



Simulation Procejaure . . ^ • - ; 

The' procedure used to simulate testing was carried out in t^hree stages: 
I) i;'esponse vector generation, Z) application of test models torresponse vectors, 
and 3) calculation of statistics and fairness indicants. , 

"^ Z Response generatioh . Generation of test responses *fpllowed procedures 
similar, to those used by Betz &'|weis^ (1973), Vale & Weiss (1975) and McB ride ^ 
and Weiss (1976). This prbcedu^ , based on latent trait test theory .(Lord & 
Novick, 1968) j requires ' two assumptions The first assumption was loc^l inde- 
pendence of responses, which requirejs. that the probability that a teste^ of 
ability 0 will answer any Item correctly is inde^^endent of whether that testee- . 
answers any othgir item correctly. Stated mathematically, this" assumption 'l>ecpmes 



0), . 



[1^ 



where / and /. are probability density functions, i refers to one of the '-ft 
items, and y .=0 if a response was incorrect, and ?;.=1 if correct. 




The ^second assump'tidn was that a response, . , .depended only on 

ability of the examinee, and '2) the characteristics of the test items", as* des- 
cribed, by each itfem^s iat;ent trait parameters a r* b and c. * ' 

With these assumptions, the response vectors wer^- generated by: 

1. - Calculating P.(0), the probability ^of answering item i correctly given 
0, from the' normal .ogive version of the latent trait test model. 



where JS . (0) - a . . (0-4 . .) » - 

<Kt) is the normal density fufwtion, 



[16] 
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J indicates subgroup membership ^majority or minority) , ana 
Q. .-.20. 

2i Detetinining the response U. by: ' 

a. Generating a random number drawn from a uniforfn distribution, 



0<r'<l. 



[f r'>P.(0), z;.-0. 



c. If a?<P.(0) ,^ T^w*!. 



Repeating this process for each item used, and fbrj' each subgroup. Two 
vectors of item^ responses were generated fOr eachjability level for each 
ItQsi pool,' one for the minority jsubgAup and one for the majority sub- 



grcpftp 



Test administration . • The response vectors served as input to ia program 
which simulated the testing process. Slhce only conventional ±ests were ^ttm- • 
uiated in this study, the program selected itTem^ sequentially "from one of the 
eighteen combinations of item parameters.. This process was repeated foi: each of 
the five test lengths' within each' combination of - the other sets of item iparameters.^ 
Varying test letigths were obtained by selecting the first m items out "of the 100 
items available, where m was the desired test length (10, 30, 50, 70 or 100 items). 

Application of fairness models . The output of the second stage of the , \ 
simulation was an estimated O, S, for each examinee for each test condition. . ' 
Therefore, a distribution of true and estimated. 0 values was produced for each 
subgroup for each of the 90 experimental conditions. Within each of these test 
conditions, the mean, standard deviation, skewne^s and kurtosis were computecT f or 
the S variable, and the validity, cieary, and ThomcfLke measures of fairness 
(i.e.f i?, T) were calculated. - V. 



. RESULTS 

[stributions of Predic ted Scores . 

J ' ^ f ■ 

Means, standard delations, skewness, and kurtosis' indices of ability est- 
imates as a function of the experimental conditions are,' given for a test length 
of 50 items in^ble 2; results fot test lengths of 10; 30, 70 and 100 items, 
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which generally parallel those for 50 items, are given in Appendix Tables » through 
E. In these tables, the statistics ^pr the true abi34ty disthribution (Q] &re 
given in the first row of the table, listed uivier the '^True" group heading, ^n 
the standard deviation column, values ,c)bta:^ned when-dif^rential prediction (D.P.; ^ 
was used are given as well as values for the majority pfeiiiction (M.P.) case. 
Since differential prediction did not affect ^any of(1:he other statistics, only 
one set of values is shown. 

As Tabld»2 shows, increasing item bias ckused the mean of the minority subV 
group to be underpredicted. The degree of underprediction increased both with . 
increasing item bias and with increasing item di"scrimination. For low item diS^ 
crimination-, the degree of underprediction was less than the degree of item bias . 
introduced, with the degrefe.-of underprediction being somewhat. larger .for the^^ 
peaked test 'at each o*f the item bias levels. At high item discrimination (a-rl. 1) , 
the degree of underprediction became essentially equal to the degree of bias it 
the .5 and, 1.0 levels pf item bi'as. With item bias equal to 2.0,' the degree, of 

-underprediction (-1.85 and -1.52, for the uniforifcatt% peaked tests , .respectively) 
more closely approached tfle degree of bias t:han.-^/«ie^||gipes of underpre^, 

.tion (-1.34 and -1.10) in .'the low item discrWa^^^^ at this same bias , 

lever.- AlsS, at tfie high item diScrim*^l«||pgJg^ underpr^edic- ^ 

tioij -was somewhat smallegjj.f or the peake^.^^^W bias level. . 



Table:^2 - 

Jcv-e Distrlbuaon Characteristics f of .C.rt^yent lonal Tests of Length 50>s»8 a 
nc tlon g'f D|scrlmloatl on (a^. Biasj /fttrCrouprfor Uniform and Peaked Te3ts_ 

Deviation 



Mean 



Uniform A Peaked Skewness . Kurtosls 



Bias Group Uniform. Peaked . M^P, KP. M.P.^ P.P. Uniform Peaked Unlfopn . Peaked ji 



.3« 



.70 



• True -.07^ 07^ 1.006 U006 1.006 1.006 -.01 -.01 .22 .22 

1 ^ ■' ' ' ■ . ■ 

i,- maj • ■ "-.07^. -.07^ . 798 . 7^8 .807 .807 .03 -,11 .00 -.06 

. min ' -.391 ''-.^02 .822 i^,805 ..811'. .820 .0^ , .01- -.09 -.00 

'l Bin -.709 -.738 .819 .-810 .82^ .822 .08 .10 -.16 -.02 

2 tain -1.336 -.1.097 .819, .815 .800 .816 .28 .33 • . l<^ , .17 

roaj -.07^ -i#7^ -9^0 .9^0. .9^i6 .9^6 -.08 -.11 -.27 -.66 

\ 5 Bin -.^96 -.535 .938 .939 .9^2 .9^9 .10 ' .19 -.33 5.66 

1 min -.953 -.89^ .929 .93^ .903 .9^1 .31 ' .A9 -.19 . -.38 

2 min -i".7^9 -1.708- .823 .920 .719 .896 , 57 1.13 .08 ,1.08 



1.1 



mai -.07^ -:07<. . 9'6-7 .967 /.959 .95^ -."03 . -.10 -.25 -1.13 

.5 min -.536 ^ -.536 .978 .965^ .937 .^57' .20 .36 -.24 -.93 



1 min -1.020 -.SL56 .9^,8 .960^ .8^i5 .937 .30 .86 -.33 ' ,-.'07 

2 min -1.852 -1.5^3 .81,3> .939 .5^0 *835 .77 2. 13 .42 4.86 



iJfote. M.P. is majority prediction equation; D.P. Is differential prediction equation.. 



The standard deviation of the ability distribution was generally under-- 
predicted, using majority prediction, both for the majority and minority subgroups. 
For. the xiniform test , the degree of .underprediction was reduced, for both groups 
as item discriminationAncreased, while for the peaked test, underprediction in- 
creased for the minority subgroup while it decreased fqr the majority subgroup.*- 
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Within the peaked test, th^ degree of Undei^prediction of the standard devia- 
tions became especially severe with i^icirea^ing item bias, at, .high discrimina- 
tions. ' • 

When differential- prediction wag Use^* however, the degree of underpredic- 
tion of the standard deviation was gi^^stantiall.y reduced- Even at a==l.l for 
the peaked' test, underprediction ^f standard deviation for the minority ^sub- 
group was virtually the" same as for majority subgroup, except for very high 
(2.0)' levels of item bias. 

The skewness for both the unifO|^ atx^I peaked tests increased in a positive 
direction as both item bias and item iJtsc^tmination^ increased. This effect was 
much larger for the peaked test. ^ fi^lA and bias of 2.0% the peaked test had 
a skewness of 2.13 compared to .77 fo< the unifortn test. The kurtosis measure 
indicated that the shape of the. dist^^^-^ution changed f rosi being somewhat flat 
(negative value) to being peaked (pogitive value)' as item hiaS was increased; 
the degree of this change was a funct^^ increasing It^^ discrimination. 
Again, the uniform- test, when compare^td' the :PeaKed test, more closely main- ^ 
tained its resemblance to the true njp^l distribution as ^i^aS was increased. „ 

Validity : i?-Inde X . }^ - 

— (J ^ s „, 

Effects oh m ajor ity, subgroup . il^^v^lidiV coefficients 'for the uniform (U) 
and peaked (P) distributions of iten^ difficulties are shoj^m in Table 3. The 
three rows ih^Talile 3 labeled "maj" giVe the validities f^r the majority sub- 
group for the three values of item*'di1^cirt«iitnation. These results c6lj;respond to 
the case where. it^tii bias is zero. . ^ ^ 

Validity was f ounl^to^^fncrease ^0 itetn discrimination and test length tn- 
{creased f orv both types of item distrit^Utt^Us . At the lower discrimination levels 
Ja^.30 and .70, the peaked distribution gaV^ higher values of validity; but at the 
high discrimination, level, a=l.l,i and ^^'^ *^est length longer than about AO, the 
advantage reversed and thfe uniform di^trifeution gave higher validities. The 
highest validity found was i?=.981 fo^^ the Uniform distributioi|^of item 4if ficul^ 
ties^ at a=l.l, for a test length of 10^ itetns. The validity for peaked tests at 
this" same point was fl^.967. The l^e^^ validity also occurred for the uniform 
distribution. At a=.30 for test lengi^H=iQ, 7?^. 493, while i?=.5A0 for the peaked 
distribution. ' , . 

■. s» ' 

.X Validity dif f^rgnges. A major c^tice^ with respect to test fairness refers 
not only to how validity varies as a ^Unq^ton of the test characteristics for a 
given subgroup, but more importantly^ how Validity varies differentially among 
subgroups. The reason for this is th^t t£ a difference in subgroup validities 
does exist, this would imply that the Predictions made on the basis of .the test 
scores are not as accurate^for one Sx^t^gro^ip aS for the oth^r. As was explaine'd 
in the introduction, such a differenc^ Validity would have several^ adverse 
effects on the subgroup having the l^^^r correlation. Therefore, tthe^effects of 
item bi^as on validity were studied by Comparing the validities for both ^ub- 
groups for all the item pools and teg^^ le^^gths. To facilitate this analysis, 
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Table 3 

Validity Coefficients for Unlfonii (U) iid Peaked (P) Conventional Tests at Five Test Lengths 
a& a Fiinctlon of Item Discriittlnatlo|ft) anditem Bias, for Majortty Group (mj) "and for 
Hihority Group (min), m Differences in Validjtifes (dlff), for the two groups 
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.m 
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differences between sutgroup val^itfes were detennined. .Differential validity 
was- thus defined as ^ 

B^.^.^ a , - R ' ' / [17] 

(Jiff min y maj 

A negative v4lue ot differential validity indicates that the majority sul^roup 
had a la^^ger validity coefficient than the minority subgroup. These values, 
appeal: in T,able 3 In the rows designated "diff". ^ ' 

Table 3 3j;ieWs that for the lowest a-value, validity differences were very . * 
small for low levels of item bias. As item bias. increased, differential validity - 
increased for the uniform test,^but decreased for thfe peaked test, except at' 100 
items where differential validity decreased for both tests./ At a=..^0, differen- 
tial validity tended to be positive in favor of the minority subgroup for both 
types of tests'! But for item discriminations of a=.7 and 1.1 for test length of 
30 and above> the direction of differential validity was reversed/and the tests 
became unfair to the minority subgroup. Asu the degree of .it;em b^ias and item 
discrimination increased, the size of this negative differential became substan- 
tial, particularly for the peaked test. This effect was pi^esent at all test 
lengths above 10 items. For example, the peaked test at a=l.l, test length=100, 
and bias*=2.0, had a .125 difference between the subgroup validities, in ^vor of 
.the majority subgroup. The largest negative differential validity for the uni- 
form tests was ^iff'^^'O^S whicfi occurred at a=l.l, test lengtb=^50>. bias=2. 0. 

C-Index 

The Cleary-type fairness measure was defined as the difference between 
the means of ^he ^true ability, "0, and the predicted ability, ^. Therefore, the 
C-Index.is iwthe same units as 0. The population: distribution of 0 had a mean 
of 0 and a standard deviation of 1.0. A negative C-Index implies unfairness for , 
a subgroup. ^Iti' Figure 3, (^^iff» subgroup differences in the C-indices 

(C ^ ^ J ) > are plotted against test length for both the uniform and peaked 
maj min 

tests fof all item pools in the majority prediction condition* As indicated bv 

Equation 8, under the assumptions of the present study, ^^iff^^^^i^f since 

C .=0. Numerical values of C by subgroup are shown in Appendix Table F. ^ 
maj dirt 

was not computed under differential prediction since, as indicated earlier, by 

definition it is always equal to zero in this condition. . , . 

As would be expected, the C- Index indicated increased unfairness for the 
minority subgroup as item bias was increased from .5 to 2.0. Unfairness also 
increased as a negatively accelerating function of test length reaching its 
highest value at a test length of 100. The rate of increase as well as the high- 
est valu^ varied as a function of item discrimination and degree of item bias^ 
For both the uniform and peaked tests, increasing item bias tended to increase 
the rate of it^crease of C witK test length within a level of it^m discrimination. 
The effect of test length decreased as item discrimination increased. 
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There ai3i)eared to,^ be very little ^differ#ice between the peaked and uniform 
distribution of difficulties on C^^^^ at the a*. 3 and .7 levels of item discrim- 
ination* The^f Terences which^*da occur appear to^favor the uniform tests ats:the^ 
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. Figure 3 ; . 
C-Index as a function of item discrimination (a), 
item bia%'and tBst length, using majority prediction 
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, .5 ard l.Obias levels at a=.3 (Figure 3a) and •? (figure 3b), and the peaked 
tests at bias of 2.0 when a=.7. However, at the hi^est discrimination level, 
(2=1.1 (Figure 3c), the uniform tests were more unfair than the peaked tests to 
the minority subgroup when the degree of item bias was large (2,0). For an item 
bias of 2.0, differences of .350 and .342 were found between^the subgroup C ^ 
values for test lengths of 70 and 100 items. Thus for this test situation, using 
peaked instead of uniform distributions of difficulty would produce an average 
estimate of ability with a decrease in item bias of more than one-third of a 
standard deviation relative to the population of true abilities. 



g-I ndex 
T 



dlff 



can be- defined as the diff erence / between the T-indices for the major- 



ity and minority subgroups, i.e., - T^^^. A negative T^^^^ indicated that 

the percent predicted to be above average was smaller for the minority than for 
the majority subgroup; i.e.., the test w^s less fair to the minority subgroup. 

25 ^ 



Majority prediction . As Figure 4 ^^ows, using majority prediction, ^^^^ 

•^rled In a complex way asf S^^iinctlon of Item dlscrlnjlnatlott, test length and 
degree of Item bias, for the .uniform and peaked .tests ^numerical values are 



Figure 4 — ^ 
,r-Index as a function of Item discrimination (a), 
Item bias, arid test length, nsix^% majority prediction 
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given in Appendix Table OH^^Bpral, however, the uniform tests were less 
unfair to the minority subg^PPm>he a=.3 and .7 levels of Item discrimina- 
tion (Figure 4a and 4b , respectively) , but showed no clear advantage at the 1.1 
level (Figure 4c) except for the shortest test length. Regardless of Iteiji 
discrimination or degree oE item bias, the shortest and longest test lengths of 
the uniform test resulted in relatively greater fairness- Only for tire inter-* 
mediate test lengths did the peaked test sometimes produce a smaller J'^^ff ^K^^ 

did the uniform test and then usually at the higher discrimination levels. In 
contrast to the C-Index, unfairness measured by T^^^^ did not increase as a 

regular fttt«;^ion of test length for the peaked test at item discrimination levels 
above a=.30. , ^ 

The largest difference in C^,^^^ between uniform and peaked tests was 11.2%, 
occurring at the highest bias and discrimination levels at test length=10 
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(Figure 4c).. Even for the a=.70, blas«.5„ test length«100 (Figure Ab) , a test 
which might lie representative of one used in real selection situations, the 
uniform tgstT would have led to the selection of 12.8% fewer minority applicants. . 

Differential prediction . The results of using differential prediction on 
r-f^imess are shown In Figure 5; numerical values are in Appendix Table H. 

—Since for the differential prediction case used the same t value for the 

dif r ' 
majority subgroup, but a different value for the minority subgroup, results from 
the two prediction situations directly show^^he reductioil in unfairness due to 
differential prediction. A comparison of Figure A with Figure 5 thus shows' that 




the main effect of using differential prediction was that a much larger perceiv- 
tage of minority applicants was predicted above average than was the case when 
majority prediction was used. Consequently, the general level of unfairness was 

reduced using differential prediction. ^ , . 

• « ■ ■ . 
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Figure 5 shows that with differential prediction, the minority subgroup 
sometimes had a greater percentage of examinees above the mean than did the 
majority subgroup. This is a situation which never. occurred in the majority 
prediction case (Figure 4) . For the most part, this overprediction f^ the 
minority subgroup occurred almost entirely for the uniform tejst and t^ded to 
decrease as test length /ind item discrimination increased. Overprediction vir- 
tually disappeared for test lengths greater, than 30, at item discriminations ©f 
a=.70 and 1-1 (Figures 5b and 5c, respectively). On the average, both the 
peaked and uniform tests tended to give higher negative values of T^^^^ as item 

discrimination increased, indicating increased unfairness, even using differ- 
ential,, prediction. This leffect was particularly pronounced for the peaked 
tests; the unfairness of uniform tests was less affected by inclteasing' item 
discrimination. 

The uniform tests, with only one exception, produced values of ^^^^^ ^^^^ 

were less negatively biased than the peaked tests. This superiority of the 
uniform tests increased as the degree of both bias and item, discrimination 
increased. The difference was particularly large for tests of shorter length. 
For a=l.l, bias=2.0 and test length^lO, there was a difference between the 
uniform and peaked tests of 23.4% in the percentages of minority testees 
predicted to be above fiverage. 



DISCUSSION 



Effects of Item Characteristics on Validity 

There has been considerable previous research (Brogden, 1946; Cronbach & 
Warrington, 1952; Gulliksen, 1945; Lord, 1952; Tucker, 1946) on the relation- 
ship between item statistics and test validity.. It generally has been shown that 
the best distribution of item difficulties for maximizing validity, ^-e.,^ corre- 
lation with underlying true ability, depends on a number of factors including 
the level of item discrimination. However, other things being equal (e.g., the 
ability distribution peaked near the difficulty level of the items), a higher 
validity will be achieved with a peaked distribution of item difficulties than 
with a uniform distribution of item difficulties, unless items with very high 
discriminations are employed. This result has led many test constructors to 
recommend the general use of peaked tests, since the level of item discrimina- 
tion at which the uniform test gives higher validity was generally thought to be 
too high to occur in realistic testing situations. 

However, most of the previous research was conducted using conventional item 
^ statistics. It has been shown (Lord, 1975; Urry, 1974) that conventional item 

statistics confound the effects of guessing with item difficulty. When guessing 
effects are properly accounted for by using latent trait parameters, the level 
of item di^rimination at which the uniform test produces higher validity is well 
wi'thin thellange which occurs in common practice. This result was first reported 
by Urry (1969, p. 140; 1974) and was reaffirmed in the present study. 

At discrimination levels of a=.3 and .7 corresponding to point-biserial 
correlations of item response and total score of .187; and .373, respectively, 
the peaked test produced a higher validity, although its advantajg^ oyer the uni- 
form test tended to decrease with increasing test length. These results are 
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similar to what has been reported with conventional item statistics. However, at 
the level of discrimination (corresponding to point-biserials of .48) and 

for tests of 50 items or more, the » uniform test produced higher validities than 
the peaked test. For a 100-item test at a=l.l, validity was .981 for the uniform 
test compared to .967 for the peaked test. This represents a substantial in- 
crease in validity at tfiis high level of correlatidn. Therefore, it would appear 
that the uniform test might be preferable in many practical isitUations. 

Effects of item bias . When test items were biased against the minority / 
subgroup, validity generally decreased as item bias increased (except at low item 
discrimination levels) for both peaked and uniform tests. This effect produced, 
validity differences between the minority and majority subgroups since items were 
uhb^sed relative to the majority^subgfoup. Furthermore, these' validity differ- 
eiiji^^ increased at\a given level of item bias ,as item discrimination increased. 
The implication of these results Is that if items are biased, increasing item ^ . 
discrimination can decrease test fairness as reflected by subgroup validity 
differences. , ~ 

',Diff erent types of tests produced different levels of unfairness as measured 
by the validity itidex. Where^ item discrimination was ^t least a=.7, the uniform 
test was clearly superior to the peaked test in producing a fair test. The 
advantage of using the uniform test increased with increasing item discrimination 
and, test length. With a peaked test, at a=l.l and a test length of 100, the 
minority subgroup had a validity .125 below that of the majority subgroup. Under 
these conditions, there was only a .021 difference in subgroup validities when a 
uniform test was used. . 

These results have several implications for the construction of tests and for 
the interpretation of existing test data. First, they of f er a 'ppssible explan- 
ation for the often-repotted but controversial phenomenon of differential validity. 
Several researchers (Campbell et at., 1973; Farr et al. ^ 1971; Schmidt, Berner, & 
Hunt er^ 1973) have presented arguments, based on various analyses of empirical' 
data, tifiat. differential validity does not exist as a substantive phenomenon; The * 
results *bf this stu^y indicate that differential validity is a definite possibility 
and, in fact, can be expected when test items. are biased against one of the sub- 
groups being tested. The fact that validity differences ar^ not often detected 
in 'practice may be due tb the ^problem of generating sufficient statistical power 
to detect a difference when it exists (Bartlett, Bobko, & Pine, in press). 

Thus, if test items are biased, differential validity is the expected result. 
Furthermore, the usual practice of selecting items having the highest Item dis- 
crimlnatidns will have the effect of increasing subgroup validity differences, 
particularly in peaked tests. 

Other Models of Selection Fairness 

In the context of this study, the d7-Index, based on Cleary'fe fairness model, 
gave the degree of statistical bias in the estimation of a known criterion value. 
The T-Index, based/dn Th^rndike's definition of fairness, reflected the impact 
of estimator bias on the percentage of applicants predicted to exceed some quali- 
fying point of ability, in this case, the mean of the population. 

The Cleary view of fairness tends to optimize selection from the vantage 
point of th^ selecting institution since it assures that the ablest candidates 
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will be selected. The Thorndike model tends to be more liberal from the viewpoint 
of the minority subgroup. Even in situations where the, Cleary inde^c indicates a 
perfectly fair test, it has been previously shown by Schmidt & Hunter (1974) that 
the Thorndike index may still indicate unfairness. This result was replicated in 
the present study. 

Furthermore, both models indicated that the nature of a test, in terms of its 
spread of item difficulties, can have a strong effect on fairness at some levels 
of item discrimination and for some test lengths. For the levels of discrimina- 
tion and test lengths most commonly found in practice, the general finding was 
that the peaked test was fairer' in terms of the C-Index, while the uniform test 
was fairer in terms of the T-Index, when majority prediction was employed. 

^ The differential prediction condition indicated the conservative nature of 
the C-Index. By definition, in this condition, all tests were perfectly fair by 
the 'Cleafy model. Yet the T-Index indicated the presence of substantial unfair- 
ness, particuXa|ly for very short tests and for highly discriminating tests. 
Furthermore, Imh differential prediction of ability, the uniform distribution of 
item difficulties predicted more minority testees to be above average across 
nearly all conditions than did the peaked distribution of item difficulties. 

. C-Index . One of the major trends in the data is shown in Figure 3; for . 
both the peaked and uniform tests, the, effect of item bias on the C-Index in-' 
creased with test length. This implies that the shorter a test is, the more fair 
it will be in terms of producing a smaller underj^edictioh of the minority ability 
level. In other words, shorter tests are less sensitive (more robust) to the 
presence of item bias than are longer tests. Unfortunately, this finding runs 
contrary both to conventional wisdom and to the results from the validity index 
which' indicated an increase in validity with increasing test length. • 

The reason for this seemingly paradoxical result 4s that the longer a test 
is, the more chance there is for bias to affect the final test score. For 
example, if a test is only one item long, the only possible test scores are 0 and 
1. Therefore, there is not as much opportunity for bias to affect the te&t score. 
On the otHer hand, if a t^st is very long, even a small degree of bias can be 
rejected in the score. 

The influence of test length on fairness as measured by the C-Index was 
reduced; however, by increasing the level of item discrimination. What this im- 
plies is that the length of a test plays a much larger role in the ultimate fair- 
ness of a test at the lower levels of discrimination than it does at the higher 
levels.. For example. Figure .3 indicates that if item bias is relatively large 
(2.0), the extent to which the minority subgroup is underpredicted will vary 
from 1 to 1.5 standard deviations a s test length increases from 30 to 100 items. 
At the highest level of^ discrimination, however, the increase in uilderprediction 
is relatively constant between these test lengths. Consequently, in order to 
achieve -a high level of validity and the smallest possible underprediction of the 
minority subgroup, the highest possible level of item discrimination should be 
maintained, particularly for short tests. 

If a test uses highly discriminating items, the distribution of item diffi- 
culties will become an important factor in test fairness as measured by the 
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C-Index. For highly .discriminat;ing items,'' if t^ete is reason to suspect a 
relatively high degrelef of . item bias, -.the results of this study indicate that a 
peaked test is to be preferred over a unifprm test.^\^ Unfortunately, this con- 
clusion conflicts wl1:Th the findings based on Jthe validity data where it was 
found that a unlf orm^Jtest produced the smallest difference in validities with 
highly discriminating items. Apparently, a decision\must be made as to which 
criterion is dbst important dn a given situation — reduction in the difference, 
,j between subgrpup validities, or ^reduction in:ithe undetrprediction of the minority 
♦ subgroup. - ' / \ 

A '' -'^^ ^ ■ ■ '■ 

In making this decis'ipn, the\ test constructor must carefully consider the 
degree of precision which^tmist be \sacrif iced in order to reduce the relative 
degree of unfairness to a^ Minority subgroup. Some minimiim degree of, precision ^ 
:must sorely be maintained or one could ^nd up with a perfectly fair, but totally 
useless selection instrument. This situation would, for example, be approached 
by employing very short tests using items with very low discriminatfon. ^ 

y-Index . As was the case with the C-Index, increasing average item discrim- 
ination had phe overall effect of increasing unfairness as measured by the 
T-Index. Th§ relationship between fairness measured by the T-Index and test 
length, however, was more complicated than i^ was when fairness was measured by 
the C-Index. For some levels of item discrimination, T-faimess increased with 
test length, while in other cases it decreased. In general, however, the fairest 
tests were the ^shortest tests using the least discriminating items. This is the 
^ same result found for ihe C-Index and was, again, probably due to the restriction 
in the number of unique scores possible and the increased unreliability charac- 
teristic of a short test. ' 

^ . / 

Results for the T-Index indicated'' that the uniform test was consistently 
less adversely affected by item bias than was the peaked test if or the lower 
levels of item discrimination. However, at higher item discriminations, neither 
test design was obviously favorable. 

V 

Implications . Some generalizations about test design can be made based on 
these results. Specifically, at moderate levels of Iteiii discrimination and test 
lengths above 50 items, uniform tests are clearly superior to peaked tests in 
terms pf reducing unfairness. This conclusion holds for all three fairness in- 
dices. At high item discrimination levels (above a=l. 1), where uniform and 
peaked tests produced conflicting results in tennis of validity and C-faimess, 
the distinction between distribution of item difficulties and fairness is less 
clear. At these levels ,^ the distribution of item difficulties does not seem to 
make much difference as -long as the tests are at least moderately long (greater • 
than 30 items). Also, at these high levels of item discrimination, the expected 
loss in relative test validity^or the. minority subgroup would be small. There- 
fore, in view of the superiority of peaked tests in terms of C-faimess under 
these conditions, they would generally be preferable to the uniform tests. 

Differential Prediction 

When differential prediction is used, a test will always be fair in terms 
of Cleary's definition of fairness. That is, there will be no overprediction or 
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underprediction of mean ability level for that subgroup. Similarly, with^^n the 
model usQd in this study,; tj^e use of differential prediction will not be re-' 
fleeted in the i?-Index siiice it aniounts to adding a constant to the scores of 
the minority subgroup. i Siich i a constant will not change the correlation of test 
scores with another variable. 

I^oweVer, a test may be unfair according to the Thorndike 'definition of un- 
fairness in the ^differential prediction condition. The degree of unfairness 
will depend on' the item discrimination level, test length and distribution of 
item difficulties. As was the case for C-faimess and for T-fairnj^ss using 
majority prediction, differential prediction was accompanied by an overall de- 
crease in fairness to the minority subgrbup as average item discr^imlnations 
increased. The relationship between T-faimess and test length, however, was 
much more >pronounced in the differential, prediction case. The distribution of 
item difficulties also had a" much larger effect in the differential prediction \ 
condition. ^ 

The most interesting effect was due to .dis^tribution of item difficulties. 
The uniform tests resulted in scores which w^re more fair to the minority sub- 
group than were scores on the peaked tests for" almost--«^l test .lengths and 
degrees of item bias. The differences in T-faimei^ between tl>fe uniform and 
peaked tests were particularly large at the shortest atw longest test lengths. 
At the highest level of item discrimination (a«l.l), the uniform tests showed a 
clear and substantial advantage over the peaked tests. 

The differences that occurred between the \uni form and peaked tests Un the 
differential prediction condition were mainly due to the sHewness and kuBtosis 
of th^ predicted score di^stributions obtained in the- tespective conditions* As 
can be seen in Table 2, the uniform tests produced a predicted score distribution 
that -was fatter and less skewed than that of the peaked tests.. These differ- 
ences in the shape of the predicted score distribut;ions increased as item discrim- 
ination was increased. ^ 

■7 . ^ • » 

The effect of the shape of jthe predicted s^ore distribution is much greater 
in the differential prediction condition than in the majority prediction condi- 
tion because of the relationships in the distribution betweea the mean of the 
score distributions and the selection cutoff. These effects can be seen in 
Figures 4 and 5. Figure 4 represents the case where majority prediction was used 
and the test items were biased against the minority subgroup. This situation 
will result in the mean & of the minority subgrcttip being below that of the major- 
ity subgroup. Since, in this case, such a small\ percentagje of the minority sub- 
group is above the majority subgroup average, ditferences In the predicted 
distributions as a function of spread in item ditfiqultie§ have a relatively/ 
small effect on T-faimess. However, with differ^t^l pred^iction (Figure 5) 
there will be no bias in the predicted means fot eitl^r ^bgroup. Consequently, 
the effects of skewness and kurtosis on T-fairness are^uch larger. 

When diffeteential prediction was used, the uniform, test wds fairer to 
the minority subgroup than was the peaked test. This result was observed across 
test length and item discrimination conditions. For the higher discrimination 
levels, this result was consistent with the results from the validity data. 
Therefore, uniform tests are clearly preferable when used in combination with 
differential prediction. These results also imply that if differential predic- 
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tlon Is employed, it is possible to avoid the problem, often encountered using 
majority prediction, of trying to simultaneously minimize differential validity 
and C- or T-faimess. 



SUMMARY AND CONCLUSIONS 

This study was concerned with how test fairness, defined in terms of test 
validity and the models presented by Cleary and Thomdike, is inf-luenced by 
test length, distribution 6f Iteni difficulties, level of item discrimination 
and degree of item bias. /The methodology involved computer simulation in' which 
bias and fairness were represented in the context of latent trait theory. This 
approach eliminates many \t the criterion measurement problems often present: in- 
empirical validation studies, and allows direct observation of the influence of 
item characteristics on t/st scores and on predictions made from those test 
scores. The situation^/tf^umed in the present study was that a single test was 
used to select an unrestricted sample of applicants from a hypothetical popula- 
tion consisting of a minority and a majority subgroup. The criterioti on which 
the selections were validated was a unidlmensional variable on which the sub- 
groups had identical distributions. ^ 

Validity 

The findings from the validity data indicated that contrary to the results 
of previous research, a uniform test often led to a higher validity for many 
practical test applications than, did a peaked' test. In fact, if item discrimina- 
tions were relatively high, uniform tests resulted in substantially higher 
validities than did peaked tests. More iiiportantly, with respect to the issue of 
test fairness, the- difference between subgroup validities could be reduced by 
using uniform rather than peaked tests. It was also found that validity differ- 
ences such as those reported and often disputed in the testing literature, are 
to be expected when test items are biasjed against one of the applicant subgroups. 
The fact that such validity differences are not always found in empirical valid- 
ation studies is probably due to the lack of power in the statistical tests used 
in these empirical investigations.. 

Selection Fairness Models 

The shapes of both the subgroup score distributions and the predicted ability 
distributions were found to be very much affected by the characteristics of the 
items included in the selection instrument. Conclusions drawn from each of the 
models used for measuring selection fairness were a function 'of the predicted 
ability distributions. Consequently^ selection fairness was found to be a 
function of a test's item characteristics as well. 

Perhaps the most relevant finding for test construction was that certain com- 
binations of item characteristics were more robust in the presence of item bias 
than were others. That is, item bias had less of an effect qn fairness for some 
combinations ot item discrimination, test lengths, and distribution of item 
difficulties, than for others. The relationships among these variables were 
very complex. In any practical application where it is necessary to know how a 
particular set of item characteristics will affect the fairness of a test, a 
simulation study should be implemented in which the conditions of the application 
are approximated as closely as possible.^ 
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> Never the l^jss, certain generalizations can be made based on the pres^ent 
results. If applicants are to be selected in a si^Si^ation similar to the condi- 
tions assumed iji'this study, a ^est having a uniform' spread of item difficulties 
wl.ll result in fairer predictions than will a peaked tj^est, if a reasonably high 
level of item discrimination can be maintained. Also; the differential predic- 
tion model can be expected to provide fairer selection . than will sole reliance on 
majority prediction equations. Furthermore, the advantages of using a uniform 
test will be enhanced in -the differential prediction .application. 

The results from the differential prediction condition indicate the conser- 
vative nature of Cleary's fairness model as compared to Thomdike's model. The 
use of differential prediction results in tests that are perfectly fair according 
to the Cleary definition, yet substantial amounts of unfairness were indicated 
•in terms of the Thomdike model. Thisjis a phenomenon often reported in the 
literature on models of fairness; different models of fairness can sometimes lead 
to divergent ^Itoplications about the fairness of a test in a given selection- 
situation. Particularly when peaked tests are employed, these two fairness . 
models will lead to different conclusions. 

Future Research 

The present study investigated only a limited class of test instruments. 
The conventional tests used are characterized by their use of an identical fixed 
sequence of items for all testees. Recently, a number of adaptive testing models 
have been developed as alternatives to the conventional model (see Wei6Sj. 1974). 
In adaptive tests, items are selected on an individual basis for each testee. 
Research ^th adaptive tests {e.g.y McBride & Weiss, 1976; Vale & Weiss, 1975) 
has shown that they result in different score distributions than do conventional 
tests, with true ability held constant. Consequently, adaptive testing methods 
might result in different degrees of fairftess in test scores. Future research 
should explore the fairness properties of adaptive testing models, and compare 
them with those of conventional tests. ^ 
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APPENDIX 



Table A 

Means and Standard Deviations of 
Item Difficulty Distributions of Item Banks 



• ' PEAKED TEST 

Item Discrimination (a) y 

Test . 3 i2 » - ' J- • 1 

Length M S.D. M S.D. M S.D. , • 

10 .00 .09 .03 .12 ,-i07 . .11 

30 f -.01 .08 .02 .11 / -.02#, .iO 

50 .00 .09 -.01 Ml .00 ..10 

70 .01 .09 .00 .11 .00 < .10 

100 .00 .09 ^ .01 .11 .00 .10 

UNIFORM TEST 

Item Discrimination (d) 

Test -3 >7 Li 

Length M S.D. M S.D... M .S.D. 

^0 TsT 1782^ -.32 '.l.^&2' . • -.32 TTH^ 

30 -.02 1.79 -s02 1.7? -.02 • 1.79 

50 -.07 1.71 -.07 1:71 -.07 1.71 

70 -.17 1.70 ^.17 1.70 -.17 1.70 

100 -.13 1.77 ^^.13 ;1.77 -.13 1.77 



.30 



.70 



l.I 



T^bl'e B 

Score Distribution Characteristics for Conventlc 
Function of Dlscrlmln»Clon>(a) , Bias, and Group, 



il Teata of Length 10, as a 
lor Unlfora and Peaked Testa 



Standard A)evlatlon 



Uniform 



Peaked 



Skevncaa 



Kurtosla 



Bias 


Group 


Uniform 


Peaked ^ 


M.P. 


D.P.y 


M.P. 


D.P. 


Unlfom 


Peaked 


Unlfora 


Peaked 




True 


-.074 


-.074 


1.006 


1.006 


1.006 


1.006 


-.01 


-.01 


.22 


.22 






-.074 


-.074 


.496 


.496 


.544 


^ .544 


-.23 


-.14 


-.22 




.5 


mln 


-.198 


-.198 


.507 


.495 


.548 


.546 


. -.24 


-.05 


-.08 




1 


nlii 


-.328 


-.338 


.507 


' .515 , 


.554 


.588 


-.13 


-.01 


-.13 


-.37 


2 


nln 


-.583 


-.604 


.500 


.526' 


.528 


.543 


.09 


.30 


-.16 


-.28 




Mj 


-.074 


-.074 


.750- 


.750 


.787 


.787 


-.27 


-.17 


-.32 


-.70 


.5 


aln 


-.357 


-.393 


.763 


. 749 


.777 


.801 


' -.21 


.22 


-.34 ■ 


-.62 


1 


■In 


-.875 


-.697 


.786 


. 769., 


.751 ' 


. .806 


-.01 


.59 


-.37 


-.35 


2 


nln 


-1.215 


-1.216 


.W 


-777 


.613 . 


.761 


.34 


1.14 


-.27 


.88 




■•J 


-.Q74 


-.074 


/.825^ 


.825 


,874 


.874 


-.41 


-.1^' 


.12 


-1.06 


.5 


■In 


-.435 


-.485 i 


f .880 


. .^34 


.877 


.885 


-.22 


.32 


-.30 


-1.00 


1 


■In 


-.813 


,-.854 1 


1 .920 * 


\849 


.810 


^58 


-.14 


.81 


-.41 


-.31 


2 


■In 


-1.554 


-1.372 


.869 




.570 


.758 


.50 


2.00 


-.35 


3.92 



icj^m^p 
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Note. M.P. la wajorlty prediction «quaclflg||§sP. Is differential pradlcClfon aquation 



Table C 

Score Dletribution Cherecteriatice for Conrvantional Taata of Langth 30, mm m 
Function of Diacrimlnation (a). Biaa» and Group* for Unifora and Peaked Teete 



Standard Deviation 



' 1^ Mean Unlfora Peaked Skevneea Kurtoala 

Blaa Group Uniform Peaked M.P. P.P. M.P. ' D»P» Unlfona Unlfoy Peeked 



.30 




True 


-.074 


*.0)^ 


1.006 


1.006 


1.006 


1.006 


r.Ol . ^ 


-.01 


.22 


.22 






-.074 


-.074 


.729 


.729 


.745 


,745 


.01 


-.12 


.16 


-.18 




.5 


mln 


-.332 


-.329 


.74i 


.746 


.746 


.759 


.04 


-.04' 


.14 


" -.05 




1 


mln 


-.601 


-.608 


.738 


.748 


.745 


.767 


.04 . 


.07 


.04 


.02 




2 


mln 


-1.107 


-1.097 


.745 


.754 


,721 


.764 


.18 


.28 


.06 


.24 


.70 






























■aj 


-.074 


-.074 


.904 


.904 


.917 


.917 


-.12 


-.13 


-.05 


-.61 




.5 


mln 


-.451 


-.508 


.894 


.904 


.905 - 


• .923 


.04 


.23 


-.15 


-.61 




1 


mln 


-.875 


-.911 


.896 


.896 


.873 


.923 


.22 


.52 


,-.14 


-.26 




2 


mln 


-1.607 


-1.598 


.809 


.885 


.715 


.866 


.57 


1.13 


.23 


1.10 


.1 






-.074 


-.074 


.938 


.938 


.945 


.945 


-.15' 


-.12 


.16 


-1.10 




.5 


mln 


-.507 . 


-.526 


.966 


.^2 


.9^.8 


;947 


.03 


.34 


-.07 


. -.95 




1 


mln 


-.975 


-.935 


.976 


.937 


.847 


.926 


.12 


.82 


-.37 


-.07 




2 


' mln 


-1.834 • 


-1.544" 


.822 


-.560 


.920 


.868 


.65 


2.09 . 


.06 


4.79 



Not^. 'M.P. le majorltry prediction equation; D.P. le differential prediction equetion. 



Teble D 

Score Dletribution Characteriatica for Conventional Teete of Length 70 » mm m 
Function- of Diacrimlnation (a), Biea» and Group* for Uniform end Peeked Teete 



Standard Devietion 

Mean Uniform Peaked Skewneae Kurtoeie 



.30 



.70 



1.1 



Bias 


Group 


Uniform 


Peeked 


M.P. 




M.P. 


D,P. 


Uniform 


Peeked 


Uniform 


Peaked 




Tfue 


-.074 


-.074 


l.OOff 


r^6o6 


1.006 


1.006 


-.01 


-.01 


.22 


,22 




maj 


' -.074 


-.074 


.851 


.851 


.853 


.853 


-.00 


-.10 


-.13 


-.04 


.5 


mln 


-.429 


-.445 


.876 


.858 


.864 


.865 


.03 


.04 


-.21 


-.04 


1 


mln 


-.781 


-.810 


.878 


<860 


.872 


.865 


.09 


.11 


-.16 


-.07 


2 


mln 


-1.488 


-1.480 


.882 


.861 


.835 


.860 


.22 


.34 


-.09 


.03 






-.074 


-.074 


.960 


. .960 


.960 


.960 


-.15 


-.11 


-.25 


-.67 


.5 


mln ' 


-.508 


-.538 


.967 


.959 


.953 


.961 


-.02 


.22 


-.32 


-.64 


1 


mln 


-.977 


-.978 


.966 


.954 


.908 


.954 


.20 


.53 


-.29 


-.29 


2 


mln 


-1 ^828 


-1.732 


.973 


.938 


.719 


.916 


.52 


1.16 


-.03 


1.17 




Mi 


-.074 


-,074 * 


.979 


.979 


.967 


.967 


-.19 


-.11 


.04 


-1.10 


.5 


mln 


-.551. 


-.541 ^ 


1.003 


.977 


.952 


.963 ) 


-.15 


.37 


-.32 


-.90 


1 


mln 


-1.048 


-.973 


.988 


.857 


.857 


.942 " 


.20 


' .88 


-.42 


-.02 


2 


mln 


-1.945 


-1.595 


.879 


.955 


.551 


.842 


.68 


2.13 


.16 


4.86 



Note. M.P. le majority prediction equation; D.P. la [dif fe^ntiel prediction aque^ion. 



39 



-33- 



8cor« Distribution Ch«r«ct«ri«tic« for ConvantionaX T««tt* of Ungth 100, • 
Function of Di«criain«tion (a), Bi««» and Croup» for Uniform and P««k«d T««t« 

Standard Daviation 



Maan Unifom Paakad Skunaaa Kurtoaia 



a 


Biaa 


Group 


Unifom 


Paakad 


M.P. 


D.P. 


H.P. 


D.P. 


Unifoni 


Paakad 


Unifom 


Paakad 






Trua 


-i074 


-.074 


1.006 


1.006^ 


1.006 


1.006 


-.01 


-.01 


.22 


.22 


.30 




Mj 


-.074 


-.074 


.889 


.889 


. .893 


.893 


-.06 


0.16. 


-.05 


.07 




.5 


■in 


-.456 


-.479 


.903 


.892 


.918 


.902 


-.00 


-.05 


-.13 


-.02 






■in 


-.837 


-.880 


.904 


.893 


.914 


.898 


.05 


.08 


-.04 


-.05 




2 


■in 


-1.606 


-1.638 


.911 


.891 


.875 


.895 


.22 


.28 


-.09 


-.03 


.70 




■aj 


-.074 


-.074 


.972 


.972 


.972 


.972 


-.14 


-.13 


-.24 


.65 






■in 


-.526 


-.553 


.971 


.971 


.969 


i973 


>03 


.19 


-.24 


-.64 




I 


■in 


-.993 


-1.011 


.962 


.968 


.922 


.965 


.19 


.52 


-.22 


-.28 




2 


■in 


-1.871 


-1.782 


.866 


, .955 


.727 


.930 


.53 


1.14 


.00 


1.08 


1.1 




■•J 


-.074 


-.074 


.987 


.987 


.972 


.972 


-.10 


-.12 


-.10 


-1.07 




.5 


^ ■in 


-1554 


-.548 


1.004 


.985 


.960 


.968 


.06 


.37 


-.23 


-.90 






■in 


-1.049 


-.985 


.981 


.981 


.663 


.947 


.19 


.88 


• -.36 


-.05 




2 


■in . 


-1.958 


-1.616 


.875 


.966 


.554 


-.846 


.63 


2.16 


.14 , 


5.03 



^ Note: H.P. ia majority pradiction aquation; D.P. ia diffarantial pradictioo aquation. 



Tabla F 

C-Indax for Unifom (U) and Paakad (P) Taata 



Taat Lan_gth ' 

10 30 50 70 . V V 100 



a 

.30 



.70 



1.1 



Bias 


Croup • 


U 


P 


U 


P 


U 


P 


U 


P 


U 


P 


0.0 




.000 


.000 


.,000 


.000 


.000 


.000 


.000 


.000 


.000 


.000 


.5 


■in 


^ -.124 


-.124 


-.258 , 




-.317 


-.328 


-.355 


-.371 


-.382 


-.405 




dif f 


-.124 


-.124 


-.258 


-.255 


-.317 


-.328 


-.355 


-.371 


-.382 


-.405 


1.0 


■in 


-.254 


-.264 


-.527 


-.534 


-.635 


-.664 


-.707 


-.736 


-.763 


-.806 




diff 


-.254 


-.264 ^ 


-.527 


-.534 


-.635 


-.664 


-.707 


-.736 


-.763 


-.806 


2.0 


■in 


-.509 


-.530 ^ 


-1.033 


-1.023 


-1.262 


-1.286 


-1.'414 


-1.406 


-1.532 


-1.564 




diff 


-.509 


-.530 


-1.033 


-i.023 


-1.262'^ 


-1.286 


-1.A14 


-1.406 


-lv532 


-1.564 


0.0 


maj 


.000 


.000 


.000 


.000 


.000 


.000 


.000 


.000^ .000 


.000 


.5 


■in 


-.283 


-.319 


-.377 


-.434 


-.422 


->.461 


-.434 


->464 


-.452 


-.479 




diff 


-.283 


-.319 


-.377 


-.434 


-.422 


-.461 


-.434, 


-.464 


-.452 


-.479 


.1.0 


■in 


-.586 


-.623 


-.801 


-.837 


-.879 


-.894 


-.903 


-.904 


r.919 


-.937 


diff 


-.586 


-.623 


-.801 


-.837 


-.879 


-.894 


-.903 


-.904 


->.919 


-.937 


2.0 


■in 


-1.141 


-1.142 


-1.533 


-1.524 


-1.675 


-1.634 


-1.754 


-1.658 


-1.7J7 


-1.708 


diff 


-1.141 


-1.142 


-1.533 


-1.524 


-1.675 


-1.634 


-1.754 


-1.658 


-1.797 


-1.708 


0.0 


■•J 


.000 


.000 


.000 


.000 


.000 


.000 


.000 


.000 


.000 


.000 


.5 


■in 


-.361 


-.411 


-.433 


-.452 


-.462 


-.462 


-.477 


-.467 


-.480 


-.474 




diff 


-.361 


-.411 


-.433 


-.452 


-.462 


-.462 


-.477 


-.467 


-.480 


-.474 


1.0 


■in 


-.739 


-.780 


-.901 


-.861 


-.946 


-.882 


-.974 


-.899 


-.975 


-.911 




diff 


-.739 


-.780 


-.mi 


-.861 


-.94.6 


-.882 


-.974 


-.899 


-.975 


-.911 


"i.o 


■in 


-1.480 


-1.298 


-1.760 


-1.470 


-1^.778 


-1.499 


-1.871 


-1.521 


-1.864 


-1.542 




diff 


-1.480 


-1.298 


-1.760 


-1.470 


-1.778 


-1.499 


-1.871 


-1.521 


-1.884 


-1.542 
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f> Table G 

7-Index for Uniform (U) and Paaked (P) Tests. Using Majority Prediction 

Tsst Langth 



10 30 50 70 100 

a Bias Group U P U P U P U P U P 

.30 

0.0 naj 38. A 56.8 A5.4 47. A 41.6 43.8' 44.0 46.2 44.0 49.8 

.5 Bin 30.4 46.2 31.8 33.8 28.0 28.2 29.6 30,2 30.6 30.^ 

diff -8.0 -10.6 -13.6 -13.6 -13.6 -15.6 -14.4 rl6.0 -13.4 -19.0 

^ 1.0 Hin 21.6 37.4 20.4 21.4 18.0 17.2 18.2 19.0 17.2 19.2 

diff -16.8 -19.4 -25.0 -26.0 -23.6 -26.6 -25.8 -27.2 -26.8 -30.6 

2.0 ain 10.0 ' 20.2 7.8 7.2 4.4 ^ 3..6 4.4 4.6 4.6 4.6 

diff -28.4 -36.6 -37.6 -40.2 -37.2 -40.2 -39.6 -41.6 -39.4 -45.2 

.io . J • 

0.0 maj 40.6 53.0 50.2 45.0 ,46.2 49.4 48.0 49.8 48.2 49.2 

.5 min 26.6 35.4 34.4 29.2 28.4 30.4 32.0 29.8 31.0 291.2 

,diff -14.0 -17.6 -15.8 -15.8 -17.8 -19.0 -16.0 -20.0 -17.2 -20.0 

1.0 mln 16.4 22.8 19.6 15.0 14.8 15.6 16.2 15.4 15.4 14.8 

diff -24.2 -30.2 -30.6 -30.0 -31.4 -33.8 -31.8 -34.4 -32.8 t34.4 

2.0 min 4.6 6.2 3.8 3.6 2.8 3.6 2.4 3.4 3.0 3.2 

diff -36.0 -46.8 -46.4 -41.4 -43.4 -45.8 -45.6 -46.4 -45.2 -46.0 

1.1 

0.0 maj 40.0 52.6 47.2 46.8 47.8 47.6 47.8 48.4 48.8 50.0 

.5 ,min 25.2 35.2 ^29.6 29.4 28.0 29.2 27.8 29.2 29.0 29.2 

^ diff -14.8 -17.4 -17.6 -17.4 -19.8 -18.4 -20.0 -19.2 -19.8 -20.8 

— ^ 1.0 min 15.0 . 20.4 17.4 15.2 15.8 14.6 13.8 15.4 15.4 15.4 

diff * -25.0 -32.2 -29.8 -31.6 -32.0 -33.0 -34.0 -33.0 -33,4 -34.6 

^ 2.0 Min 3.4 4.8 3.0 3,4 2.4 3,4 2.2 3.0 ' 2.6 2.6 

diff " -36.6 -47.8 -44.2 -43.4 -45.4 -44.2 -45.6 -45.4 -46.2 -47.4 



Table H 

T- Index for Uniform (U) and Peaked (P) Tests, Using Different Ul Prediction 

'. Test Length 

10 30 ^ 50 70 100 

a Bias Group ' U P U P U P U P U P_ 

.30 

0.0 m^j 38.4 56.8 45.4 47.4 .41.6 43.8 44.0 46.2 44.0 49.8 

^ . .5 mln 52.6 46.2 40.6 42.6 48.4 47.8 45.0 42.0 46.8 45.4 

— " diff lU^^l -10.6 -4.8 -4.8 6.8 4.0 1.0 -4.2 2.8 -4.4 

1.0 min 41 .8 37.4 49.2 47.8 47.2 44.4 44.4 46.2 46.0 44.2 

diff 3.4 -19.4 3.8 .4 5.6 .6 .4 0.0 2.0 -5.6 

2 0 min 43.6 33.0 41.4 39.0 44,6 45.2 41.4 44.2 44.0 42.4 

diff 5.2 -23.8 -4.0 -8.4 3.0 1.4 -2.6 -2.0 0.0 -7.4 

'^^ 0.0 maj 40.6 . 53.0 50.2 45.0 46.2' 49.4 48.0 49.8 48.2 49.2 

5 min 47.8 47.8 48.2 41.6 47.0 45.8 47.4 46.4 45.8 46.4 

diff 7.2 -5.2 -2.0 -3.4 .8 -3.6 -.6 -3.4 -2.4 -2.8 

1.0 min 49.8 46.0 45.0 41.4 47.0 44.8 45.6 41.8 47.6 42.8 

diff 9.2 -7.0 -5.2 -3.6 .8 -4.6 ,-2.4 -8.0 -.6 -6.4 

^0 min ' 41.2 31.2 44.0 36.6 40.8 38.2^ V39.4 38.8 42.4 37.4 

diff , .6 -21.8 -6.2 -8.4 -5.4 -11.2 -8.6 -11.0 -5.8 -11.8 

^' 0.0 maj 40.0 52.6 47.2 46.8 47.8 47.6 47.8 48.4 48.8 50.0 

.5 min 46.0 43.6 43.8 42.0 45.6 <i4.0 48.2 43.2 47.4 45.0 

diff 6.0 -9.0 -3.4 -4.8 -2.2 -3.6 .4 -5.2 -1.4 -5.0 

■ 1.0 min 48.4 36.2 48.8 39.6 42.6 39.6 46.6 38.4 45.6 39.4 

diff 8.4 -16.4 1.6 -7.2 -5.2 -8.0 -1.2 -10.0 -3.2 -10.6 

.2.0 min/^ 51.0 27.6 40.8 28.6 43.0 30.8 43.0 30.8 44.0 29.6 

diff 11.0 -25.0 -6.4 -18.2 -4.8 -16.8. -4.8 -17.6 -4.8 -20.4 
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